Intro

A common workflow for data validation in pointblank involves three basic components:

We always start with create_agent() and define how the data can be reached and also provide some basic rules about how an interrogation of how that data should eventually be carried out. While we are giving the agent some default behavior, we can override some of this on a step-by-step basis when declaring our validation steps. We always end with interrogate() and that function carries out the work of validating the data and generating the all-important reporting.

A simple data validation on a small dataset called small_table

The package contains a few datasets. A really small one for experimenting is called small_table:

pointblank::small_table
## # A tibble: 13 × 8
##    date_time           date           a b             c      d e     f    
##    <dttm>              <date>     <int> <chr>     <dbl>  <dbl> <lgl> <chr>
##  1 2016-01-04 11:00:00 2016-01-04     2 1-bcd-345     3  3423. TRUE  high 
##  2 2016-01-04 00:32:00 2016-01-04     3 5-egh-163     8 10000. TRUE  low  
##  3 2016-01-05 13:32:00 2016-01-05     6 8-kdg-938     3  2343. TRUE  high 
##  4 2016-01-06 17:23:00 2016-01-06     2 5-jdo-903    NA  3892. FALSE mid  
##  5 2016-01-09 12:36:00 2016-01-09     8 3-ldm-038     7   284. TRUE  low  
##  6 2016-01-11 06:15:00 2016-01-11     4 2-dhe-923     4  3291. TRUE  mid  
##  7 2016-01-15 18:46:00 2016-01-15     7 1-knw-093     3   843. TRUE  high 
##  8 2016-01-17 11:27:00 2016-01-17     4 5-boe-639     2  1036. FALSE low  
##  9 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9   838. FALSE high 
## 10 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9   838. FALSE high 
## 11 2016-01-26 20:07:00 2016-01-26     4 2-dmx-010     7   834. TRUE  low  
## 12 2016-01-28 02:51:00 2016-01-28     2 7-dmx-010     8   108. FALSE low  
## 13 2016-01-30 11:23:00 2016-01-30     1 3-dka-303    NA  2230. TRUE  high

Let’s create an agent, give it the small_table, and look at the report.

# Create the agent with `create_agent()`; the `tbl` is given to the agent
agent_1 <- 
  create_agent(
    tbl = small_table,
    tbl_name = "small_table",
    label = "Workshop agent No. 1",
  )

# Printing the `agent` will print the report with the default options
agent_1
Pointblank Validation PlanNo Interrogation Performed
Workshop agent No. 1 tibble small_table

STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT

Okay. Let’s provide a few validation functions.

agent_1 <-
  agent_1 %>%
  col_vals_gte(columns = d, value = 0) %>%
  col_is_logical(columns = e) %>%
  col_is_character(columns = c(b, f)) %>%
  col_is_numeric(columns = d) %>%
  col_vals_in_set(columns = f, set = c("low", "mid", "high")) %>%
  rows_distinct()

agent_1
Pointblank Validation PlanNo Interrogation Performed
Workshop agent No. 1 tibble small_table

STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
1
col_vals_gte
 col_vals_gte()

d

0

2
col_is_logical
 col_is_logical()

e

3
col_is_character
 col_is_character()

b

4
col_is_character
 col_is_character()

f

5
col_is_numeric
 col_is_numeric()

d

6
col_vals_in_set
 col_vals_in_set()

f

low, mid, high

7
rows_distinct
 rows_distinct()

The report contains the information about the validation steps but many of the table cells have no data. That area is the interrogation data, and, we haven’t yet used the interrogate() function. Let’s use it now.

agent_1 <- agent_1 %>% interrogate()

agent_1
Pointblank Validation
Workshop agent No. 1

tibble small_table
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
1
col_vals_gte
 col_vals_gte()

d

0

13 13
1.00
0
0.00

2
col_is_logical
 col_is_logical()

e

1 1
1.00
0
0.00

3
col_is_character
 col_is_character()

b

1 1
1.00
0
0.00

4
col_is_character
 col_is_character()

f

1 1
1.00
0
0.00

5
col_is_numeric
 col_is_numeric()

d

1 1
1.00
0
0.00

6
col_vals_in_set
 col_vals_in_set()

f

low, mid, high

13 13
1.00
0
0.00

7
rows_distinct
 rows_distinct()

13 11
0.85
2
0.15

2022-10-13 00:03:53 EDT < 1 s 2022-10-13 00:03:53 EDT

Now, we see a report we can use! Let’s go over each of the columns and understand what they mean.

  • STEP: the name of the validation function used for the validation step and the step number.

  • COLUMNS: the names of the target columns used in the validation step (if applicable).

  • VALUES: the values used in the validation step, where applicable; this could be as literal values, as column names, an expression, etc.

  • TBL: indicates whether any there were any changes to the target table just prior to interrogation. A rightward arrow from a small circle indicates that there was no mutation of the table. An arrow from a circle to a purple square indicates that ‘preconditions’ were used to modify the target table. An arrow from a circle to a half-filled circle indicates that the target table has been ‘segmented’.

  • EVAL: a symbol that denotes the success of interrogation evaluation for each step. A checkmark indicates no issues with evaluation. A warning sign indicates that a warning occurred during evaluation. An explosion symbol indicates that evaluation failed due to an error. Hover over the symbol for details on each condition.

  • UNITS: the total number of test units for the validation step

  • PASS: on top is the absolute number of passing test units and below that is the fraction of passing test units over the total number of test units.

  • FAIL: on top is the absolute number of failing test units and below that is the fraction of failing test units over the total number of test units.

  • W, S, N: indicators that show whether the warn, stop, or notify states were entered; unset states appear as dashes, states that are set with thresholds appear as unfilled circles when not entered and filled when thresholds are exceeded (colors for W, S, and N are amber, red, and blue)

  • EXT: a column that provides buttons to download data extracts as CSV files for row-based validation steps having failing test units. Buttons only appear when there is data to collect.

We see nothing in the W, S, and N columns. This is because we have to explicitly set thresholds for those to be active. We’ll do that next…

Data validation with threshold levels

We often should think about what’s tolerable in terms of data quality and implement that into our reporting. Let’s set proportional failure thresholds to the warn, stop, and notify states using the action_levels() function.

# Create an `action_levels` object with the namesake function.
al <- 
  action_levels(
      warn_at = 0.15,
      stop_at = 0.25,
    notify_at = 0.35
  )

# This can be printed for inspection
al
## -- The `action_levels` settings
## WARN failure threshold of 0.15 of all test units.
## STOP failure threshold of 0.25 of all test units.
## NOTIFY failure threshold of 0.35 of all test units.
## ----

We are using threshold fractions of test units (between 0 and 1). For 0.15, this means that if 15% percent of the test units are found to fail (i.e., don’t meet the expectation), then the designated failure state is entered.

Absolute values starting from 1 can be used instead, and this constitutes an absolute failure threshold (e.g., 10 means that if 10 of the test units are found to fail, the failure state is entered).

What are test units? They make up the individual tests for a validation step. They will vary by the validation function used but, in simple terms, a validation function that validates values in a column will have the number of test units equal to the number of rows. A validation function that validates a column type will have exactly one test unit. This is always given in the UNITS column of the reporting table.

Let’s use the action_levels object in a new validation process (based on the same small_table dataset). We’ll make it so the validation functions used will result in more failing test units.

agent_2 <-
  create_agent(
    tbl = small_table,
    tbl_name = "small_table",
    label = "Workshop agent No. 2",
    actions = al
  ) %>%
  col_is_posix(columns = date_time) %>%
  col_vals_lt(columns = a, value = 7) %>%
  col_vals_regex(columns = b, regex = "^[0-9]-[a-w]{3}-[2-9]{3}$") %>%
  col_vals_between(columns = d, left = 0, right = 4000) %>%
  col_is_logical(columns = e) %>%
  col_is_character(columns = c(b, f)) %>%
  col_vals_lt(columns = d, value = 9600) %>%
  col_vals_in_set(columns = f, set = c("low", "mid")) %>%
  rows_distinct() %>%
  interrogate()

agent_2
Pointblank Validation
Workshop agent No. 2

tibble small_tableWARN 0.15 STOP 0.25 NOTIFY 0.35
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
1
col_is_posix
 col_is_posix()

date_time

1 1
1.00
0
0.00

2
col_vals_lt
 col_vals_lt()

a

7

13 11
0.85
2
0.15

3
col_vals_regex
 col_vals_regex()

b

^[0-9]-[a-w]{3}-[2-9]{3}$

13 6
0.46
7
0.54

4
col_vals_between
 col_vals_between()

d

[0, 4,000]

13 12
0.92
1
0.08

5
col_is_logical
 col_is_logical()

e

1 1
1.00
0
0.00

6
col_is_character
 col_is_character()

b

1 1
1.00
0
0.00

7
col_is_character
 col_is_character()

f

1 1
1.00
0
0.00

8
col_vals_lt
 col_vals_lt()

d

9600

13 12
0.92
1
0.08

9
col_vals_in_set
 col_vals_in_set()

f

low, mid

13 7
0.54
6
0.46

10
rows_distinct
 rows_distinct()

13 11
0.85
2
0.15

2022-10-13 00:03:53 EDT < 1 s 2022-10-13 00:03:54 EDT

Some notes:

  • the thresholds for the warn, stop, and notify states are presented in the table header; these are defaults for every validation step
  • we now have some indicators of failure thresholds being met (look at the W, S, and N columns); steps 2, 3, 9, and 10 have at least the warn condition
  • it’s possible to have test unit failures but not enter a warn state (look at steps 4 and 8); they still provide CSVs for failed rows but the W indicator circle isn’t filled in

How you set the default thresholds will depend on how strict the measure for data quality is. There might be certain validation steps where we’d like to be more stringent. For the next validation process we will apply the action_levels() function to individual steps, overriding the default setting.

agent_3 <-
  create_agent(
    tbl = small_table,
    tbl_name = "small_table",
    label = "Workshop agent No. 3",
    actions = al
  ) %>%
  col_is_posix(columns = date_time) %>%
  col_vals_lt(columns = a, value = 7) %>%
  col_vals_regex(columns = b, regex = "^[0-9]-[a-w]{3}-[2-9]{3}$") %>%
  col_vals_between(
    columns = d,
    left = 0,
    right = 4000,
    actions = action_levels( # Setting `actions` at the individual
      warn_at = 1,           # validation step. This time, using absolute
      stop_at = 3,           # threshold values (i.e., a single test unit
      notify_at = 5          # failing triggers the `warn` state
    )
  ) %>%
  col_is_logical(columns = e) %>%
  col_is_character(columns = c(b, f)) %>%
  col_vals_lt(columns = d, value = 9600) %>%
  col_vals_in_set(columns = f, set = c("low", "mid")) %>%
  rows_distinct() %>%
  interrogate()

agent_3
Pointblank Validation
Workshop agent No. 3

tibble small_tableWARN 0.15 STOP 0.25 NOTIFY 0.35
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
1
col_is_posix
 col_is_posix()

date_time

1 1
1.00
0
0.00

2
col_vals_lt
 col_vals_lt()

a

7

13 11
0.85
2
0.15

3
col_vals_regex
 col_vals_regex()

b

^[0-9]-[a-w]{3}-[2-9]{3}$

13 6
0.46
7
0.54

4
col_vals_between
 col_vals_between()

d

[0, 4,000]

13 12
0.92
1
0.08

5
col_is_logical
 col_is_logical()

e

1 1
1.00
0
0.00

6
col_is_character
 col_is_character()

b

1 1
1.00
0
0.00

7
col_is_character
 col_is_character()

f

1 1
1.00
0
0.00

8
col_vals_lt
 col_vals_lt()

d

9600

13 12
0.92
1
0.08

9
col_vals_in_set
 col_vals_in_set()

f

low, mid

13 7
0.54
6
0.46

10
rows_distinct
 rows_distinct()

13 11
0.85
2
0.15

2022-10-13 00:03:54 EDT < 1 s 2022-10-13 00:03:55 EDT

Data extracts for failed rows

Those CSV buttons are useful for sharing the report with others since they don’t even need to know R to get some use from those extracts. For the person familiar with R and pointblank, it is possible to get data frames for the failed rows (per validation step).

We can use the get_data_extracts() function to obtain a list of data frames, or, use the i argument to get a data frame available for a specific step. Not all steps will have associated data frames. Also, not all validation functions will produce data frames here (they need to check values in columns).

Let’s use get_data_extracts() on agent_3.

get_data_extracts(agent = agent_3)
## $`2`
## # A tibble: 2 × 8
##   date_time           date           a b             c     d e     f    
##   <dttm>              <date>     <int> <chr>     <dbl> <dbl> <lgl> <chr>
## 1 2016-01-09 12:36:00 2016-01-09     8 3-ldm-038     7  284. TRUE  low  
## 2 2016-01-15 18:46:00 2016-01-15     7 1-knw-093     3  843. TRUE  high 
## 
## $`3`
## # A tibble: 7 × 8
##   date_time           date           a b             c      d e     f    
##   <dttm>              <date>     <int> <chr>     <dbl>  <dbl> <lgl> <chr>
## 1 2016-01-04 00:32:00 2016-01-04     3 5-egh-163     8 10000. TRUE  low  
## 2 2016-01-06 17:23:00 2016-01-06     2 5-jdo-903    NA  3892. FALSE mid  
## 3 2016-01-09 12:36:00 2016-01-09     8 3-ldm-038     7   284. TRUE  low  
## 4 2016-01-15 18:46:00 2016-01-15     7 1-knw-093     3   843. TRUE  high 
## 5 2016-01-26 20:07:00 2016-01-26     4 2-dmx-010     7   834. TRUE  low  
## 6 2016-01-28 02:51:00 2016-01-28     2 7-dmx-010     8   108. FALSE low  
## 7 2016-01-30 11:23:00 2016-01-30     1 3-dka-303    NA  2230. TRUE  high 
## 
## $`4`
## # A tibble: 1 × 8
##   date_time           date           a b             c      d e     f    
##   <dttm>              <date>     <int> <chr>     <dbl>  <dbl> <lgl> <chr>
## 1 2016-01-04 00:32:00 2016-01-04     3 5-egh-163     8 10000. TRUE  low  
## 
## $`8`
## # A tibble: 1 × 8
##   date_time           date           a b             c      d e     f    
##   <dttm>              <date>     <int> <chr>     <dbl>  <dbl> <lgl> <chr>
## 1 2016-01-04 00:32:00 2016-01-04     3 5-egh-163     8 10000. TRUE  low  
## 
## $`9`
## # A tibble: 6 × 8
##   date_time           date           a b             c     d e     f    
##   <dttm>              <date>     <int> <chr>     <dbl> <dbl> <lgl> <chr>
## 1 2016-01-04 11:00:00 2016-01-04     2 1-bcd-345     3 3423. TRUE  high 
## 2 2016-01-05 13:32:00 2016-01-05     6 8-kdg-938     3 2343. TRUE  high 
## 3 2016-01-15 18:46:00 2016-01-15     7 1-knw-093     3  843. TRUE  high 
## 4 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9  838. FALSE high 
## 5 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9  838. FALSE high 
## 6 2016-01-30 11:23:00 2016-01-30     1 3-dka-303    NA 2230. TRUE  high 
## 
## $`10`
## # A tibble: 2 × 8
##   date_time           date           a b             c     d e     f    
##   <dttm>              <date>     <int> <chr>     <dbl> <dbl> <lgl> <chr>
## 1 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9  838. FALSE high 
## 2 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9  838. FALSE high

The list components are named for the validation steps that have data extracts (i.e., filtered rows where test unit failures occurred). Let’s get an individual data extract from step 9 (the col_vals_in_set() step, which looked at column f):

get_data_extracts(agent = agent_3, i = 9)
## # A tibble: 6 × 8
##   date_time           date           a b             c     d e     f    
##   <dttm>              <date>     <int> <chr>     <dbl> <dbl> <lgl> <chr>
## 1 2016-01-04 11:00:00 2016-01-04     2 1-bcd-345     3 3423. TRUE  high 
## 2 2016-01-05 13:32:00 2016-01-05     6 8-kdg-938     3 2343. TRUE  high 
## 3 2016-01-15 18:46:00 2016-01-15     7 1-knw-093     3  843. TRUE  high 
## 4 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9  838. FALSE high 
## 5 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9  838. FALSE high 
## 6 2016-01-30 11:23:00 2016-01-30     1 3-dka-303    NA 2230. TRUE  high

Getting ‘sundered’ data back (either ‘good’ or ‘bad’ rows)

Sometimes, if your methodology allows for it, you want to use the best part of the input data for something else. With the get_sundered_data(), we use provide an agent object that interrogated the data and what we get back could be:

  • the ‘pass’ data piece (rows with no failing test units across all row-based validation functions)
  • the ‘fail’ data piece (rows with at least one failing test unit across the same series of validations)
  • all the data with a new column that labels each row as passing or failing across validation steps (the labels can be customized).

Let’s make new agent and validate small_table again.

agent_4 <-
  create_agent(
    tbl = small_table,
    tbl_name = "small_table",
    label = "Workshop agent No. 4"
  ) %>%
  col_vals_gt(columns = d, value = 1000) %>%
  col_vals_between(
    columns = c,
    left = vars(a), right = vars(d), # Using values in columns, not literal vals
    na_pass = TRUE
  ) %>%
  interrogate()

agent_4
Pointblank Validation
Workshop agent No. 4

tibble small_table
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
1
col_vals_gt
 col_vals_gt()

d

1000

13 7
0.54
6
0.46

2
col_vals_between
 col_vals_between()

c

[a, d]

13 9
0.69
4
0.31

2022-10-13 00:03:55 EDT < 1 s 2022-10-13 00:03:55 EDT

Get the sundered data piece that contains only rows that passed both validation steps (this is the default piece). This yields 5 of 13 total rows.

agent_4 %>% get_sundered_data()
## # A tibble: 5 × 8
##   date_time           date           a b             c      d e     f    
##   <dttm>              <date>     <int> <chr>     <dbl>  <dbl> <lgl> <chr>
## 1 2016-01-04 11:00:00 2016-01-04     2 1-bcd-345     3  3423. TRUE  high 
## 2 2016-01-04 00:32:00 2016-01-04     3 5-egh-163     8 10000. TRUE  low  
## 3 2016-01-06 17:23:00 2016-01-06     2 5-jdo-903    NA  3892. FALSE mid  
## 4 2016-01-11 06:15:00 2016-01-11     4 2-dhe-923     4  3291. TRUE  mid  
## 5 2016-01-30 11:23:00 2016-01-30     1 3-dka-303    NA  2230. TRUE  high

Get the complementary data piece: all of those rows that failed either of the two validation steps. This yields 8 of 13 total rows.

agent_4 %>% get_sundered_data(type = "fail")
## # A tibble: 8 × 8
##   date_time           date           a b             c     d e     f    
##   <dttm>              <date>     <int> <chr>     <dbl> <dbl> <lgl> <chr>
## 1 2016-01-05 13:32:00 2016-01-05     6 8-kdg-938     3 2343. TRUE  high 
## 2 2016-01-09 12:36:00 2016-01-09     8 3-ldm-038     7  284. TRUE  low  
## 3 2016-01-15 18:46:00 2016-01-15     7 1-knw-093     3  843. TRUE  high 
## 4 2016-01-17 11:27:00 2016-01-17     4 5-boe-639     2 1036. FALSE low  
## 5 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9  838. FALSE high 
## 6 2016-01-20 04:30:00 2016-01-20     3 5-bce-642     9  838. FALSE high 
## 7 2016-01-26 20:07:00 2016-01-26     4 2-dmx-010     7  834. TRUE  low  
## 8 2016-01-28 02:51:00 2016-01-28     2 7-dmx-010     8  108. FALSE low

We can get all of the input data returned with a flag column (called .pb_combined). This is done by using type = "combined" and that rightmost column will contain "pass" and "fail" values.

agent_4 %>% get_sundered_data(type = "combined")
## # A tibble: 13 × 9
##    date_time           date           a b           c      d e     f     .pb_c…¹
##    <dttm>              <date>     <int> <chr>   <dbl>  <dbl> <lgl> <chr> <chr>  
##  1 2016-01-04 11:00:00 2016-01-04     2 1-bcd-…     3  3423. TRUE  high  pass   
##  2 2016-01-04 00:32:00 2016-01-04     3 5-egh-…     8 10000. TRUE  low   pass   
##  3 2016-01-05 13:32:00 2016-01-05     6 8-kdg-…     3  2343. TRUE  high  fail   
##  4 2016-01-06 17:23:00 2016-01-06     2 5-jdo-…    NA  3892. FALSE mid   pass   
##  5 2016-01-09 12:36:00 2016-01-09     8 3-ldm-…     7   284. TRUE  low   fail   
##  6 2016-01-11 06:15:00 2016-01-11     4 2-dhe-…     4  3291. TRUE  mid   pass   
##  7 2016-01-15 18:46:00 2016-01-15     7 1-knw-…     3   843. TRUE  high  fail   
##  8 2016-01-17 11:27:00 2016-01-17     4 5-boe-…     2  1036. FALSE low   fail   
##  9 2016-01-20 04:30:00 2016-01-20     3 5-bce-…     9   838. FALSE high  fail   
## 10 2016-01-20 04:30:00 2016-01-20     3 5-bce-…     9   838. FALSE high  fail   
## 11 2016-01-26 20:07:00 2016-01-26     4 2-dmx-…     7   834. TRUE  low   fail   
## 12 2016-01-28 02:51:00 2016-01-28     2 7-dmx-…     8   108. FALSE low   fail   
## 13 2016-01-30 11:23:00 2016-01-30     1 3-dka-…    NA  2230. TRUE  high  pass   
## # … with abbreviated variable name ¹​.pb_combined

The labels can be changed and this is flexible:

agent_4 %>% get_sundered_data(type = "combined", pass_fail = c(TRUE, FALSE))
## # A tibble: 13 × 9
##    date_time           date           a b           c      d e     f     .pb_c…¹
##    <dttm>              <date>     <int> <chr>   <dbl>  <dbl> <lgl> <chr> <lgl>  
##  1 2016-01-04 11:00:00 2016-01-04     2 1-bcd-…     3  3423. TRUE  high  TRUE   
##  2 2016-01-04 00:32:00 2016-01-04     3 5-egh-…     8 10000. TRUE  low   TRUE   
##  3 2016-01-05 13:32:00 2016-01-05     6 8-kdg-…     3  2343. TRUE  high  FALSE  
##  4 2016-01-06 17:23:00 2016-01-06     2 5-jdo-…    NA  3892. FALSE mid   TRUE   
##  5 2016-01-09 12:36:00 2016-01-09     8 3-ldm-…     7   284. TRUE  low   FALSE  
##  6 2016-01-11 06:15:00 2016-01-11     4 2-dhe-…     4  3291. TRUE  mid   TRUE   
##  7 2016-01-15 18:46:00 2016-01-15     7 1-knw-…     3   843. TRUE  high  FALSE  
##  8 2016-01-17 11:27:00 2016-01-17     4 5-boe-…     2  1036. FALSE low   FALSE  
##  9 2016-01-20 04:30:00 2016-01-20     3 5-bce-…     9   838. FALSE high  FALSE  
## 10 2016-01-20 04:30:00 2016-01-20     3 5-bce-…     9   838. FALSE high  FALSE  
## 11 2016-01-26 20:07:00 2016-01-26     4 2-dmx-…     7   834. TRUE  low   FALSE  
## 12 2016-01-28 02:51:00 2016-01-28     2 7-dmx-…     8   108. FALSE low   FALSE  
## 13 2016-01-30 11:23:00 2016-01-30     1 3-dka-…    NA  2230. TRUE  high  TRUE   
## # … with abbreviated variable name ¹​.pb_combined
agent_4 %>% get_sundered_data(type = "combined", pass_fail = 0:1)
## # A tibble: 13 × 9
##    date_time           date           a b           c      d e     f     .pb_c…¹
##    <dttm>              <date>     <int> <chr>   <dbl>  <dbl> <lgl> <chr>   <int>
##  1 2016-01-04 11:00:00 2016-01-04     2 1-bcd-…     3  3423. TRUE  high        0
##  2 2016-01-04 00:32:00 2016-01-04     3 5-egh-…     8 10000. TRUE  low         0
##  3 2016-01-05 13:32:00 2016-01-05     6 8-kdg-…     3  2343. TRUE  high        1
##  4 2016-01-06 17:23:00 2016-01-06     2 5-jdo-…    NA  3892. FALSE mid         0
##  5 2016-01-09 12:36:00 2016-01-09     8 3-ldm-…     7   284. TRUE  low         1
##  6 2016-01-11 06:15:00 2016-01-11     4 2-dhe-…     4  3291. TRUE  mid         0
##  7 2016-01-15 18:46:00 2016-01-15     7 1-knw-…     3   843. TRUE  high        1
##  8 2016-01-17 11:27:00 2016-01-17     4 5-boe-…     2  1036. FALSE low         1
##  9 2016-01-20 04:30:00 2016-01-20     3 5-bce-…     9   838. FALSE high        1
## 10 2016-01-20 04:30:00 2016-01-20     3 5-bce-…     9   838. FALSE high        1
## 11 2016-01-26 20:07:00 2016-01-26     4 2-dmx-…     7   834. TRUE  low         1
## 12 2016-01-28 02:51:00 2016-01-28     2 7-dmx-…     8   108. FALSE low         1
## 13 2016-01-30 11:23:00 2016-01-30     1 3-dka-…    NA  2230. TRUE  high        0
## # … with abbreviated variable name ¹​.pb_combined

Accessing the plan/interrogation data with get_agent_x_list()

The agent’s x-list is a record of information that the agent possesses at any given time. The x-list will contain the most complete information after an interrogation has taken place (before then, the data largely reflects the validation plan).

The x-list can be constrained to a particular validation step (by supplying the step number to the i argument), or, we can get the information for all validation steps by leaving i unspecified.

Let’s obtain such a list from agent_3, which had 10 validation steps:

# Generate the `x_list` object from `agent_3`
x_list_3 <- agent_3 %>% get_agent_x_list

# Printing this gives us a console preview
# of which components are available
x_list_3
## -- The x-list for table `small_table`
## ---- ALL STEPS ----
## $time_start $time_end (POSIXct [1])
## $label $tbl_name $tbl_src $tbl_src_details (chr [1])
## $tbl (spec_tbl_df tbl_df tbl data.frame)
## $col_names $col_types (chr [8])
## $i $type $columns $values $label $briefs (mixed [10])
## $eval_error $eval_warning (lgl [10])
## $capture_stack (list [10])
## $n $n_passed $n_failed $f_passed $f_failed (num [10])
## $warn $stop $notify (lgl [10])
## $validation_set (tbl_df [10, 35])
## $lang (chr [1])
## $report_object (blastula_message)
## $report_html $report_html_small (chr [1])
## ----

The amount of information contained in here is comprehensive (see ?get_agent_x_list for a detailed breakdown) but we can provide a few examples.

The number of test units in each validation step.

x_list_3$n
##  [1]  1 13 13 13  1  1  1 13 13 13

The number of passing test units in each validation step.

x_list_3$n_passed
##  [1]  1 11  6 12  1  1  1 12  7 11

The fraction of passing test units in each validation step.

x_list_3$f_passed
##  [1] 1.00000 0.84615 0.46154 0.92308 1.00000 1.00000 1.00000 0.92308 0.53846
## [10] 0.84615

The warn, stop, and notify states. We can arrange that in a tibble and use the step numbers (i) as well.

dplyr::tibble(
  step   = x_list_3$i,
  warn   = x_list_3$warn,
  stop   = x_list_3$stop,
  notify = x_list_3$notify
)
## # A tibble: 10 × 4
##     step warn  stop  notify
##    <int> <lgl> <lgl> <lgl> 
##  1     1 FALSE FALSE FALSE 
##  2     2 TRUE  FALSE FALSE 
##  3     3 TRUE  TRUE  TRUE  
##  4     4 TRUE  FALSE FALSE 
##  5     5 FALSE FALSE FALSE 
##  6     6 FALSE FALSE FALSE 
##  7     7 FALSE FALSE FALSE 
##  8     8 FALSE FALSE FALSE 
##  9     9 TRUE  TRUE  TRUE  
## 10    10 TRUE  FALSE FALSE

Emailing the interrogation report with email_create()

We can choose to email the report if the notify state is entered. The message can be created with the agent through the email_create() function. Here’s a useful bit of code that allows for conditional sending.

if (any(x_list_3$notify)) {

  email_create(agent_3) %>%
    blastula::smtp_send(
      from = "sender@email.com",
      to = "recipient@email.com",
      credentials = creds_file(file = "email_secret")
    )
}

Such code might be useful during an automated process where data is periodically checked and failures beyond thresholds require notification.

While email_create() will generate the email message body, functions in the blastula package are responsible for the sending of that email. For more information on sending HTML email, look at the help article found by using ?blastula::smtp_send.

Customizing the interrogation report with get_agent_report()

We don’t have to fully accept the defaults for a data validation report. Using get_agent_report() gives us options.

Here’s how you can change the title:

agent_3 %>% get_agent_report(title = "The **3rd** Example")
The 3rd Example
Workshop agent No. 3

tibble small_tableWARN 0.15 STOP 0.25 NOTIFY 0.35
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
1
col_is_posix
 col_is_posix()

date_time

1 1
1.00
0
0.00

2
col_vals_lt
 col_vals_lt()

a

7

13 11
0.85
2
0.15

3
col_vals_regex
 col_vals_regex()

b

^[0-9]-[a-w]{3}-[2-9]{3}$

13 6
0.46
7
0.54

4
col_vals_between
 col_vals_between()

d

[0, 4,000]

13 12
0.92
1
0.08

5
col_is_logical
 col_is_logical()

e

1 1
1.00
0
0.00

6
col_is_character
 col_is_character()

b

1 1
1.00
0
0.00

7
col_is_character
 col_is_character()

f

1 1
1.00
0
0.00

8
col_vals_lt
 col_vals_lt()

d

9600

13 12
0.92
1
0.08

9
col_vals_in_set
 col_vals_in_set()

f

low, mid

13 7
0.54
6
0.46

10
rows_distinct
 rows_distinct()

13 11
0.85
2
0.15

2022-10-13 00:03:54 EDT < 1 s 2022-10-13 00:03:55 EDT

You can bring the steps that had serious failures up to the top:

agent_3 %>% get_agent_report(arrange_by = "severity")
Pointblank Validation
Workshop agent No. 3

tibble small_tableWARN 0.15 STOP 0.25 NOTIFY 0.35
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
3
col_vals_regex
 col_vals_regex()

b

^[0-9]-[a-w]{3}-[2-9]{3}$

13 6
0.46
7
0.54

9
col_vals_in_set
 col_vals_in_set()

f

low, mid

13 7
0.54
6
0.46

2
col_vals_lt
 col_vals_lt()

a

7

13 11
0.85
2
0.15

4
col_vals_between
 col_vals_between()

d

[0, 4,000]

13 12
0.92
1
0.08

10
rows_distinct
 rows_distinct()

13 11
0.85
2
0.15

1
col_is_posix
 col_is_posix()

date_time

1 1
1.00
0
0.00

5
col_is_logical
 col_is_logical()

e

1 1
1.00
0
0.00

6
col_is_character
 col_is_character()

b

1 1
1.00
0
0.00

7
col_is_character
 col_is_character()

f

1 1
1.00
0
0.00

8
col_vals_lt
 col_vals_lt()

d

9600

13 12
0.92
1
0.08

2022-10-13 00:03:54 EDT < 1 s 2022-10-13 00:03:55 EDT

You can remove those steps that had no failures:

agent_3 %>% get_agent_report(arrange_by = "severity", keep = "fail_states")
Pointblank Validation
Workshop agent No. 3

tibble small_tableWARN 0.15 STOP 0.25 NOTIFY 0.35
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
3
col_vals_regex
 col_vals_regex()

b

^[0-9]-[a-w]{3}-[2-9]{3}$

13 6
0.46
7
0.54

9
col_vals_in_set
 col_vals_in_set()

f

low, mid

13 7
0.54
6
0.46

2
col_vals_lt
 col_vals_lt()

a

7

13 11
0.85
2
0.15

4
col_vals_between
 col_vals_between()

d

[0, 4,000]

13 12
0.92
1
0.08

10
rows_distinct
 rows_distinct()

13 11
0.85
2
0.15

2022-10-13 00:03:54 EDT < 1 s 2022-10-13 00:03:55 EDT

You can change the language of the report:

agent_3 %>% get_agent_report(lang = "de")
Pointblank-Validierung
Workshop agent No. 3

tibble small_tableWARN 0,15 STOP 0,25 NOTIFY 0,35
SCHRITT SPALTEN WERTE TBL EVAL UNITS PASS FAIL W S N EXT
1
col_is_posix
 col_is_posix()

date_time

1 1
1,00
0
0,00

2
col_vals_lt
 col_vals_lt()

a

7

13 11
0,85
2
0,15

3
col_vals_regex
 col_vals_regex()

b

^[0-9]-[a-w]{3}-[2-9]{3}$

13 6
0,46
7
0,54

4
col_vals_between
 col_vals_between()

d

[0, 4.000]

13 12
0,92
1
0,08

5
col_is_logical
 col_is_logical()

e

1 1
1,00
0
0,00

6
col_is_character
 col_is_character()

b

1 1
1,00
0
0,00

7
col_is_character
 col_is_character()

f

1 1
1,00
0
0,00

8
col_vals_lt
 col_vals_lt()

d

9600

13 12
0,92
1
0,08

9
col_vals_in_set
 col_vals_in_set()

f

low, mid

13 7
0,54
6
0,46

10
rows_distinct
 rows_distinct()

13 11
0,85
2
0,15

2022-10-13 00:03:54 EDT < 1 s 2022-10-13 00:03:55 EDT

SUMMARY

  1. Data validation in pointblank requires the creation of an agent, validation functions, and an interrogation.
  2. The agent creates a report that tries to be informative and easily explainable.
  3. We can set data quality thresholds with action_levels(); there can be default DQ thresholds and step-specific thresholds (in both cases, supplied to actions).
  4. We can get data extracts pertaining to failing test units in rows of the input dataset (with get_data_extracts()).
  5. There is the option to obtain ‘sundered’ data, which is the input data split by whether cells contained failing test units (with get_sundered_data())
  6. A huge amount of validation data can be accessed with the get_agent_x_list() function (useful for programming with the validation results).
  7. We can create an email message using a specialized version of the validation report with email_create(); this integrates with the blastula R package.
  8. The report can be modified with get_agent_report().